Optimisation concerns search for a maximum (or minimum) of a function.
An optimisation problem can be expressed quite compactly as follows: \[ \max_{x\in\mathcal{D}} f(x) \quad\text{or}\quad \underset{x\in\mathcal{D}}{\operatorname{argmax}}f(x),\] where \(\mathcal{D}\) is a subset of the function’s domain, over which we try to find a maximiser.
The former is the (global) maximum, while the latter is a set of (global) maximisers.
In other words, a maximiser \(x^*\in\operatorname{argmax}f(x)\) satisfies \[ f(x)\leq f(x^*),\; \forall x\in\mathcal{D}. \]
Any minimisation problem can be transformed into a maximisation problem: \[ \underset{x\in\mathcal{D}}{\operatorname{argmin}} f(x) = \underset{x\in\mathcal{D}}{\operatorname{argmax}} -f(x).\]
Global maximisers are generally difficult to find because our knowledge of \(f\) is only local.
We can evaluate \(f\) at only a finite number of points and obtain a partial picture of the overall shape of \(y=f(x)\).
In one-dimensional cases \((\mathcal{D}\subset\mathbb{R})\), partial pictures may be sufficient for locating global maxima.
For example, it is easy to evaluate \[ f_1(x)=\sin(3\pi x)/x+x^3-5x^2+6x \] at 301 points, interpolate them, and obtain the following graph.
This is easy because computation of the objective function is cheap.
Some functions in practice are very expensive to compute, taking hours per evaluation (e.g., large-scale simulations).
Modern optimisation problems are often high-dimensional (e.g., deep neural networks).
In such cases, exhaustive search is infeasible and we have to be content with local maxima.
Recall that a point \(x^*\) is a local maximiser if there exists a neighbourhood \(\mathcal{N}\subset\mathcal{D}\) of \(x^*\) such that \(f(x)\leq f(x^*)\)for all \(x\in\mathcal{N}\).
We use two maximisation problems to test our algorithms. The first one: \[ \max_{x\in\mathcal{D_1}} f_1(x) = \max_{x\in\mathcal{D_1}} \frac{\sin(3\pi x)}{x}+x^3-5x^2+6x . \] where \(\mathcal{D_1} = [0.3, 3.5]\).
fgh1 <- function(x) {
f <- sin(3*pi*x)/x+x^3-5*x^2+6*x
grad <- (3*pi*x*cos(3*pi*x) - sin(3*pi*x))/x^2 + 3*x^2 - 10*x + 6
H <- -(9*pi^2*x^2*sin(3*pi*x) + 2*(3*pi*x*cos(3*pi*x) - sin(3*pi*x)))/x^3 + 6*x - 10
return(c(f,grad,H))
}The second one: \[ \max_{(x,y)\in\mathcal{D_2}} f_2(x,y) = \max_{(x,y)\in\mathcal{D_2}} \sin\left(\frac{x^2}{2} - \frac{y^2}{4}\right)\cos\left(2x-e^y\right). \] where \(\mathcal{D_2} = [-0.5, 3]\times[-0.5, 2]\). \(y = f_2(x)\) looks like the following.
f2 <- function(x, y) sin(x^2/2 - y^2/4)*cos(2*x-exp(y))
XY <- meshgrid(seq(-.5, 3, length.out=100), seq(-.5, 2, length.out=100))
Z <- f2(XY$X, XY$Y)
plot_ly(x=XY$X, y=XY$Y, z=Z, showscale=F) %>% add_surface()fgh2 <- function(x) {
f <- sin(x[1]^2/2-x[2]^2/4)*cos(2*x[1]-exp(x[2]))
g1 <- x[1]*cos(2*x[1]-exp(x[2]))*cos(x[1]^2/2-x[2]^2/4) -
2*sin(2*x[1]-exp(x[2]))*sin(x[1]^2/2-x[2]^2/4)
g2 <- -(x[2]*cos(2*x[1]-exp(x[2]))*cos(x[1]^2/2-x[2]^2/4))/2 +
exp(x[2])*sin(2*x[1]-exp(x[2]))*sin(x[1]^2/2-x[2]^2/4)
h11 <- cos(exp(x[2])-2*x[1])*cos(x[1]^2/2-x[2]^2/4) -
x[1]^2*sin(x[1]^2/2-x[2]^2/4)*cos(exp(x[2])-2*x[1]) +
4*x[1]*sin(exp(x[2])-2*x[1])*cos(x[1]^2/2-x[2]^2/4) -
4*sin(x[1]^2/2-x[2]^2/4)*cos(exp(x[2])-2*x[1])
h12 <- -x[1]*exp(x[2])*sin(exp(x[2])-2*x[1])*cos(x[1]^2/2-x[2]^2/4) -
x[2]*sin(exp(x[2])-2*x[1])*cos(x[1]^2/2-x[2]^2/4) +
2*exp(x[2])*sin(x[1]^2/2-x[2]^2/4)*cos(exp(x[2])-2*x[1]) +
x[1]*x[2]*sin(x[1]^2/2-x[2]^2/4)*cos(exp(x[2])-2*x[1])/2
h22 <- -exp(x[2])*sin(exp(x[2])-2*x[1])*sin(x[1]^2/2-x[2]^2/4) -
cos(exp(x[2])-2*x[1])*cos(x[1]^2/2-x[2]^2/4)/2 -
x[2]^2*sin(x[1]^2/2-x[2]^2/4)*cos(exp(x[2])-2*x[1])/4 +
exp(x[2])*x[2]*sin(exp(x[2])-2*x[1])*cos(x[1]^2/2-x[2]^2/4) -
exp(2*x[2])*sin(x[1]^2/2-x[2]^2/4)*cos(exp(x[2])-2*x[1])
return(list(f, c(g1,g2), matrix(c(h11,h12,h12,h22),2,2)))
}This is nasty, but we need this information.
Based on the same “bracketing” idea used in the bisection method for root finding.
Derivative is not required.
We successively isolates smaller intervals (brackets) that contain a local maximum.
To isolate a local maximum, we use three points: \(x_l\), \(x_r\), and \(x_m\) such that \(x_m\in(x_l,x_r)\). A key observation is: \[ f(x_l) < f(x_m) \quad\text{and}\quad f(x_r) < f(x_m) \] implies existence of a local maximiser in the interval \([x_l,x_r]\).
The idea at #3 is that we want to shrink the bracket while keeping the estimate of the maximiser in the bracket. If \(f(y)<f(x_m)\), we cannot update the estimate but still want to shrink the bracket.
At step #2 above, the golden ratio arises when our motivation is to choose \(y\) so that the ratio of the lengths of the larger to the smaller subintervals remains the same.
“It can be shown that this rate of convergence is optimal in the sense that if you choose \(y\) any other way, then, in the worst-case, the bracketing interval will converge more slowly.” (p.207, JMR)
curve(-x^2, xaxt="n", yaxt="n", ann=FALSE, from=-1, to=1)
letters <- c(expression(x[l]),expression(x[m]),"y",expression(x[r]))
from_x <- c(-.9,-.1,.25,.9)
axis(1, from_x, labels=letters)
from_y <- rep(-1.2, 4)
to_x <- from_x
to_y <- sapply(from_x, function(x) return(-x^2))
segments(from_x, from_y, to_x, to_y, lty=3)
arrows(from_x[1], -.9, to_x[2], -.9, length=.1, code=3)
text((to_x[1]+to_x[2])/2, -.9, "a", pos=3)
arrows(from_x[2], -.9, to_x[4], -.9, length=.1, code=3)
text((to_x[2]+to_x[4])/2, -.9, "b", pos=3)
arrows(from_x[2], -.6, to_x[3], -.6, length=.1, code=3)
text((to_x[2]+to_x[3])/2, -.6, "c", pos=3)
points(to_x[2:3], to_y[2:3])
text(to_x[2], to_y[2]+.01, expression(f(x[m])), pos=2)
text(to_x[3], to_y[3]+.01, expression(f(y)), pos=4)
title("Illustration of the golden ratio")The right subinterval is larger, so \(I = (x_m,x_r)\) (#1). Then, pick \(y\in I\) (equivalently, pick \(c\)) such that \(\frac{a}{c} = \frac{b}{a}\) as \(f(y)<f(x_m)\) (#2).
Instead, if below is the situation, where we get a better estimate of the maximum, then pick \(y\) (or \(c\)) such that \(\frac{b-c}{c} = \frac{b}{a}\) as \(f(y)\geq f(x_m)\).
curve(-x^2, xaxt="n", yaxt="n", ann=FALSE, from=-1, to=1)
letters <- c(expression(x[l]),expression(x[m]),"y",expression(x[r]))
from_x <- c(-.9,-.2,.1,.9)
axis(1, from_x, labels=letters)
from_y <- rep(-1.2, 4)
to_x <- from_x
to_y <- sapply(from_x, function(x) return(-x^2))
segments(from_x, from_y, to_x, to_y, lty=3)
arrows(from_x[1], -.9, to_x[2], -.9, length=.1, code=3)
text((to_x[1]+to_x[2])/2, -.9, "a", pos=3)
arrows(from_x[2], -.9, to_x[4], -.9, length=.1, code=3)
text((to_x[2]+to_x[4])/2, -.9, "b", pos=3)
arrows(from_x[2], -.6, to_x[3], -.6, length=.1, code=3)
text((to_x[2]+to_x[3])/2, -.6, "c", pos=3)
points(to_x[2:3], to_y[2:3])
text(to_x[2], to_y[2]+.01, expression(f(x[m])), pos=2)
text(to_x[3], to_y[3]+.01, expression(f(y)), pos=4)
title("Illustration of the golden ratio")Given \(I = (x_m,x_r)\), two situations can arise from our pick of \(y\) and we do not know which, because \(f(y)\) is unknown before picking \(y\) and evaluating \(f(y)\).
Our motivation is to pick \(y\) no matter which scenario happens to be the case.
Formally, pick \(y=c+x_m\) such that \[ \frac{a}{c} = \frac{b}{a} \quad \text{and} \quad \frac{b-c}{c} = \frac{b}{a} .\] Or, simply \[c = b - a.\]
Plugging this \(c\) in either equation, we get \[\begin{gather} \frac{b}{a} = \frac{a}{b-a}\\ \Leftrightarrow\quad \left(\frac{b}{a}\right)^2-\left(\frac{b}{a}\right)-1 = 0 \end{gather}\] A positive root is the golden-ratio \[\frac{b}{a} = \frac{1+\sqrt{5}}{2}.\]
In short, given three points \(x_l\), \(x_m\), and \(x_r\), we choose \[\begin{align} y &= x_m + c\\ &= x_m + \frac{a^2}{b}\\ &= x_m + \frac{(x_m-x_l)}{(1+\sqrt{5})/2}. \end{align}\]
Try to see what will happen if the left subinterval is larger and \(I = (x_l,x_m)\).
Here is an implementation of the algorithm.
gsection <- function(f, x.l, x.r, x.m, tol=1e-8) {
gr <- (1 + sqrt(5))/2
f.l <- f(x.l)[1]
f.r <- f(x.r)[1]
f.m <- f(x.m)[1]
# Exception
if (f.l >= f.m) {
return("ERROR: f(x.l) < f(x.m) required.")
} else if (f.r >= f.m) {
return("ERROR: f(x.r) < f(x.m) required.")
}
# Main
n <- 0
while ((x.r - x.l) > tol) {
n <- n + 1
if ((x.r - x.m) > (x.m - x.l)) {
y <- x.m + (x.m - x.l)/gr
f.y <- f(y)[1]
if (f.y >= f.m) {
x.l <- x.m
f.l <- f.m
x.m <- y
f.m <- f.y
} else {
x.r <- y
f.r <- f.y
}
} else {
y <- x.m - (x.r - x.m)/gr
f.y <- f(y)[1]
if (f.y >= f.m) {
x.r <- x.m
f.r <- f.m
x.m <- y
f.m <- f.y
} else {
x.l <- y
f.l <- f.y
}
}
}
return(list("x"=x.m, "f"=f(x.m)[1], n=n))
}Apply it to \(f_1(x)=\sin(3\pi x)/x+x^3-5x^2+6x\) with initial \((x_l,x_r,x_m)=(1.1,2.1,1.6)\).
gsection(fgh1, 1.1, 2.1, 1.6)## $x
## [1] 1.455602
##
## $f
## [1] 1.851551
##
## $n
## [1] 39
Try other initial triples and see which local maxima you find.
Thanks to the popularity of deep learning, gradient descent has been studied extensively as a class of optimisation methods.
A gradient \(\nabla f\) is a vector of partial derivatives of \(f\), containing information about the direction of fastest increase in \(f\) (as well as the rate of increase).
When searching for a local maximum, it is almost natural, albeit naive, to follow the direction of gradient at each iteration.
Hence, the algorithm. \[ x_{n+1} = x_n + \alpha \nabla f(x_n).\]
\(\alpha\) is a step size that determines how far we move in the direction of \(\nabla f(x_n)\).
Unless \(f\) is linear, \(\nabla f\) is a function of \(x\) and \(\nabla f(x_n)\) will not tell the steepest direction once we move out of \(x_n\). So, as a rule of thumb, we should take a small step at a time.
Let’s first apply it to the 1-D problem.
gd1 <- function(fg, x0, alpha=1e-2, tol=1e-6, n.max=1e5) {
x <- x0
fg.x <- fg(x)
n <- 0
while ((abs(fg.x[2]) > tol) & (n < n.max)) {
x <- x + alpha*fg.x[2]
fg.x <- fg(x)
n <- n + 1
}
return(list("x"=x, "f"=fg(x)[1], n=n))
}Note the stopping criteria. Besides the maximum number of iterations, it uses \(|\alpha f'(x)|\leq\epsilon\) because that is an actual step to be taken and \(f'(x)=0\) at a local maximum.
Try other stopping criteria such as \(|f(x_{n+1})-f(x_n)|\leq\epsilon\) and \(\|x_{n+1}-x_n\|\leq\epsilon\).
gd1(fgh1, 1.3)## $x
## [1] 1.455602
##
## $f
## [1] 1.851551
##
## $n
## [1] 19
Identify all local maxima using different initial points. You can eyeball the graph and determine points from which to hill-climb. What will happen if you start at \(x\leq0.4\) or \(x\geq3.1\)?
Next, the 2-D problem using \(x_0=(1.6, -0.3)\).
gd2 <- function(fg, x0, alpha=1e-2, tol=1e-6, n.max=1e5) {
x <- x0
grad <- fg(x)[[2]]
n <- 0
while ((max(abs(grad)) > tol) & (n < n.max)) {
x <- x + alpha*grad
grad <- fg(x)[[2]]
n <- n + 1
}
return(list("x"=x, "f"=fg(x)[[1]], n=n))
}gd2(fgh2, c(1.6, -.3))## $x
## [1] 2.030697 1.401526
##
## $f
## [1] 1
##
## $n
## [1] 1102
But, if using \(x_0=(-0.2, 0.3)\),
To see the issue, let’s examine the following situation, where \(f'(x) = -4x^3+33x^2-76x+40\), which implies \(f'(x)=0\) at \(x\in\{0.75,2.93,4.57\}\).
Suppose \(x_n=0\), where we have \(f'(0)=40\). Thus, \(x_{n+1} = 40\alpha\) implying that, if we use \(\alpha>0.074\), we will have \(x_{n+1}>2.93\), miss the first local maximum, and climb up towards the second local maximum \(x^*=4.57\).
How can we know \(\alpha=0.08\) is too large?
The issue is much more difficult to handle in higher dimensions, where there are the infinite number of directions to choose from and we could circle around so long before approaching any local maximum.
Of course, if we always use a very small \(\alpha\), progress might be too slow to be practical.
We should try out multiple \(\alpha\) values to find the best, which sounds like another optimisation problem, doesn’t it?
One solution to the step size selection is to choose: \[ \alpha^* \in \underset{\alpha\geq 0}{\operatorname{argmax}} g(\alpha) = \underset{\alpha\geq 0}{\operatorname{argmax}} f(x_n + \alpha \nabla f(x_n)) . \] This is another maximisation problem. But, it is a 1-D problem and relatively easy to solve.
Remember that this maximisation problem with respect to \(\alpha\) is solved at each iteration within the origianl maximisation problem.
Here, we try to find a maximiser \(\alpha^*\) using brute-force search, which is naively simple but often effective in 1-D problems. Specifically, generate 1,000 \(\alpha\)s equally-spaced between \(0\) and \(0.1\), evaluate \(g\) at each \(\alpha\), and return \(\alpha^*\) such that \(g(\alpha^*)\) is the largest among 1,000 \(g(\alpha)\)s.
Brute-force optimisation is out of the question in high-dimensional problems. But, even in 1-D problems, it depends and may not be feasible if a search space is large or fine-grid search is required (e.g., \(\alpha=10^{-5}\) and \(\alpha=10^{-6}\) make significant difference.)
First, the 1-D problem.
line.search1 <- function(fg, x0, tol=1e-6, n.max=1e5) {
g <- function(alpha, x, grad) fg(x + alpha*grad)[1]
x <- x0
grad <- fg(x)[2]
# Brute-force search in alpha
Alphas <- seq(0, .1, length.out=1e3)
Gs <- sapply(Alphas, g, x=x, grad=grad)
alpha <- Alphas[which.max(Gs)]
n <- 0
while ((abs(grad) > tol) & (n < n.max)) {
x <- x + alpha*grad
grad <- fg(x)[2]
# Brute-force search in alpha
Alphas <- seq(0, .1, length.out=1e3)
Gs <- sapply(Alphas, g, x=x, grad=grad)
alpha <- Alphas[which.max(Gs)]
n <- n + 1
}
return(list("x"=x, "f"=fg(x)[[1]], n=n))
}line.search1(fgh1, 1.3)## $x
## [1] 1.455602
##
## $f
## [1] 1.851551
##
## $n
## [1] 3
Next, the 2-D problem.
line.search2 <- function(fg, x0, tol=1e-6, n.max=1e5) {
g <- function(alpha, x, grad) fg(x+alpha*grad)[[1]]
x <- x0
grad <- fg(x)[[2]]
# Brute-force search in alpha
Alphas <- seq(0, .1, length.out=1e3)
Gs <- sapply(Alphas, g, x=x, grad=grad)
alpha <- Alphas[which.max(Gs)]
n <- 0
while (max((abs(grad)) > tol) & (n < n.max)) {
x <- x + alpha*grad
grad <- fg(x)[[2]]
# Brute-force search in alpha
Alphas <- seq(0, .1, length.out=1e3)
Gs <- sapply(Alphas, g, x=x, grad=grad)
alpha <- Alphas[which.max(Gs)]
n <- n + 1
}
return(list("x"=x, "f"=fg(x)[[1]], n=n))
}line.search2(fgh2, c(1.6, -.3))## $x
## [1] 2.030697 1.401526
##
## $f
## [1] 1
##
## $n
## [1] 108
While ignoring the cost for inner optimisation, we see significant reduction in the number of steps taken to reach local maxima.
In 1-D problems, Newton’s method is simply to apply the Newton-Raphson method to find a root of \(f'\).
This makes sense because \(f'(x)=0\) at a local maximum.
Assuming that \(f\) is twice continuously differentiable, we have: \[ x_{n+1} = x_n - \frac{f'(x_n)}{f''(x_n)} .\]
To generalize it to multi-dimensional problems, let’s first recall the key idea used in the Newton-Raphson method.
When iteratively searching for a root of \(g\), the method uses linear approximation of \(g\) at the current guess \(x_n\) as it is easy to find a root of linear function. Then, the approximated root becomes our next guess \(x_{n+1}\).
Now, imagine a vector-valued function with \(k\) inputs and \(k\) outputs: \(g:\mathbb{R}^k\to\mathbb{R}^k\). In this case, the equivalents of root and derivative are respectively \(0\) vector and the Jacobian matrix (\(J\)). Doing the same approximation at \(x_n\in\mathbb{R}^k\), we get: \[ g(x) \approx g(x_n) + J(x_n)(x-x_n) .\] Setting the left-hand side equal to \(0\) and rearranging, we derive: \[ x_{n+1} = x_n - J(x_n)^{-1}g(x_n) .\]
Assuming \(J\) has a non-zero determinant at \(x_n\).
Based on this, if we assume the objective function \(f\) is twice continuously differentiable, then we have multi-dimensional Newton’s method: \[ x_{n+1} = x_n - H(x_n)^{-1}\nabla f(x_n) ,\] where \(H\) is the Hessian matrix of \(f\).
To see the correspondence, remember that for \(f:\mathbb{R}^k\to\mathbb{R}\), \(\nabla f\) is a vector-valued function and its Jacobian is the Hessian of \(f\).
Another derivation is:
Let’s apply Newton’s method to the 2-D testbed.
newton <- function(fgh, x0, tol=1e-6, n.max = 100) {
x <- x0
grad <- fgh(x)[[2]]
H <- fgh(x)[[3]]
n <- 0
while ((max(abs(grad)) > tol) & (n < n.max)) {
x <- x - solve(H, grad)
grad <- fgh(x)[[2]]
H <- fgh(x)[[3]]
n <- n + 1
}
return(list("x"=x, "f"=fgh(x)[[1]], n=n))
}newton(fgh2, c(1.6, -.3))## $x
## [1] 0.02637089 5.30394570
##
## $f
## [1] -0.681151
##
## $n
## [1] 9
What happened?
Newton’s method is extremely sensitive to the choice of \(x_0\).
Since the method is based on root finding of \(\nabla f\), it may converge to wherever the gradient vanishes: minima, maxima or saddle points.
When Newton’s method works, its convergence is quick, often using only a fraction of steps taken by line search. For example, it works for \[ f(x,y) = x(x-1)(x+1) - y(y-1)(y+1) \] plotted below. The price for speed is the requirement of second derivatives, which may be infeasible to compute in complex problems.
f3 <- function(x, y) x*(x-1)*(x+1) - y*(y-1)*(y+1)
XY = meshgrid(seq(-1, 1, length.out=100), seq(-1, 1, length.out=100))
Z <- f3(XY$X, XY$Y)
plot_ly(x=XY$X, y=XY$Y, z=Z, showscale=F) %>% add_surface()